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Abstract 

Analyzing big data in a highly dynamic environment becomes more and more 
critical because of the increasingly need for end-to-end processing of this data. 
Modern data flows are quite complex and there are not efficient, cost-based, 
fully-automated, scalable optimization solutions that can facilitate flow de¬ 
signers. The state-of-the-art proposals fail to provide near optimal solutions 
even for simple data flows. To tackle this problem, we introduce a set of 
approximate algorithms for dehning the execution order of the constituent 
tasks, in order to minimize the total execution cost of a data flow. We also 
present the advantages of the parallel execution of data flows. We validated 
our proposals in both a real tool and synthetic flows and the results show 
that we can achieve signihcant speed-ups, moving much closer to optimal 
solutions. 
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1. Introduction 

Data analysis in a highly dynamic environment becomes more and more 
critical in order to extract high-quality information from raw data that is 
nowadays produced at an extreme scale. The ultimate goal is to derive ac¬ 
tionable information in a timely manner. To this end, we typically employ 
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fully automated data-centric flows (or simply called data flows) both for 
business intelligence [H and scientific purposes j^, which typically execute 
under demanding performance requirements, e.g., to complete in a few sec¬ 
onds. Meeting such requirements, combined with the volatile nature of the 
environment and the data, gives rise to the need for efficient optimization 
techniques tailored to data flows. 

Data flows dehne the processing of large data volumes as a sequence 
of data manipulation tasks. An example of a real-world, analytic flow is 
one that processes free-form text data retrieved from Twitter (tweets) that 
comment on products in order to compose a dynamic report considering sales, 
advertisement campaigns and user feedback after performing a dozen of steps 
j^. Example steps include extraction of date information, quantifying the 
user sentiment through text analysis, hltering, grouping and expanding the 
information contained in the tweets through lookups in (static) data sources. 
Another example is to process newspaper articles, perform linguistic analysis, 
extract named entities and then establish relationships between companies 
and persons j^. The tasks in a flow can either have a direct correspondence 
to relational operators, such as hlters, grouping, aggregates and joins, or 
encapsulate arbitrary data transformations, text analytics, machine learning 

algorithms and so on 0,0,3. 

One of the most important steps in the data flow design is the speci- 
hcation of the execution order of the constituent tasks. In practice, this is 
usually the result of a manual procedure, which, in many cases results in non- 
optimal flow execution plans. Furthermore, even if a data flow is optimal for 
a specihc input data set, it may prove signihcantly suboptimal for another 
data set with different characteristics 0]. We tackle this problem through the 
proposal of optimization algorithms that can provide the optimal execution 
order of the tasks in a data flow in an efficient manner and relieve the flow 
designers from the burden of selecting the task ordering on their own. We 
consider a single optimization objective, namely the minimization of the sum 
of the task execution costs; we assume that the execution cost of each task 
depends on the volume of data to be processed, which in turn depends on 
the relative position of the task in the execution flow. The main challenges 
in flow optimization that need to be addressed and differentiate the problem 
from that of traditional query optimization are as follows: 

1. No arbitrary task orderings are valid, which means that the optimiza¬ 
tion algorithms need to respect the precedence constraints among tasks. 
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E.g., in the introductory example, we cannot move a task that com¬ 
putes the average sentiment value from tweets before executing the task 
that quantihes the sentiment of the user through text analysis. 

2. Flows can be very large with many constituent tasks, e.g., up to one 
hundred. 


The main implication is that query optimization techniques, which oper¬ 
ate on plans with up to a few tens of operators that belong to the relational 
algebra (according to which operator reordering is typically permitted), are 
not applicable |8|, |9|. Nevertheless, they are successful in their domain and 
this is the reason the data flow solutions proposed in this work are partially 
inspired by query optimization as we explain later. Overall, to date, there 
are very few prop osals that deal with (or are applicable to) task reordering 


in data flows [lO, lUl, l6| . A common characteristic of these proposals is that 
they are too slow to hnd an exact solution in small flows , or they can End 
significantly suboptimal (approximate) solutions for bigger flows (lol. Q. 

In this work, we go beyond the state-of-the-art; we present both approxi¬ 
mate and exact solutions. The approximate solutions are applicable to large 
flows and attain signihcantly better performance (more than 2 times speed-up 
in some settings, whereas in stand-alone cases, the speed-up is two or three 
orders of magnitude). The exact solution that we propose, although it cannot 
scale in general, it can process larger flows than those currently amenable to 
exact optimization. Our solutions apply to flows comprising any type of tasks 
and require as input common metadata that is task-independent, such as the 
average task selectivity and the task cost per invocation (e.g., in time units). 
Initially, we target linear flows, that is flows that can be described as a chain 
of activities with a single source and a single sink task; later, we relax this 
assumption. The proposed optimization solutions were validated, as a proof 
of concept, in a real environment, nar nely Pentaho Data Integration (PDI), 
which is a widespread data flow tool [1^. Additionally, we performed thor¬ 
ough evaluations against existing approaches for synthetic data flows. The 
summary of our contributions is as follows: 


1. We provide a case study of data flow optimization implemented in PDI 
to provide insights into the inefficiency of the existing approaches and 
the actual benehts of our approaches (Section [3]). 

2. We show that, under certain conditions, it is practical to derive optimal 
linear flows, even when the number of tasks is relatively large. Contrary 
to the case of query optimization, the most efficient solutions are those 
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that leverage algorithms enumerating valid topological orderings rather 
than dynamic programming or backtracking techniques (Section Hj). 

3. We introduce novel approximate low complexity algorithms that can 
be used for task reordering in data flows that have the form of a chain 
(Section [5]). 

4. We discuss algorithms that produce flow execution plans, where a task 
sends its output to several downstream tasks in parallel; such an ap¬ 
proach is suitable when the task selectivities are above 1, and can fur¬ 
ther improve on the performance of the flow execution plans (Section 

ED- 

5. We show how we can extend the solutions mentioned above to non 
linear flows with arbitrary number of sources and sinks (Section [7]). 

6. We conduct thorough experiments in synthetic flows to detect the best 
optimization algorithm for linear and non-linear data flows among all 
of our proposals (Section [8]). The evaluation results prove that the 
approaches introduced here signihcantly and consistently outperform 
the state-of-the-art in all out experiments. 


An extended abstract of some of the ideas above appears in 



2. Problem Statement and Background 

In this paper, we deal with the problem of re-ordering the tasks of a data 
flow without violating possible precedence constraints between tasks, while 
the performance of the flow is maximized. The data flow is represented as 
a directed acyclic graph (DAG), where each task corresponds to a node in 
the graph and the edges between nodes represent intermediate data ship¬ 
ping among tasks; i.e., in data flows, the exchange of data between tasks is 
explicitly represented through edges. The main notation, terminology and 
assumptions are as follows: 

• Let G = (T, E) be a directed acyclic graph, where T denotes the nodes 
of the graph (that correspond to flow tasks) and E represents the edges 
(that correspond to the flow of data among the tasks). G corresponds 
to the execution plan of a data flow, since it defines the exact execution 
order of the tasks. 
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• r = {ti,is a set of task^ of size n. Each flow task is responsible 
for one or both of the following: (i) reading or retrieving or storing 
data, and (ii) manipnlating data. 

• Let E = {edgei^ ...^edgem} be a set of edges of size m. Each edge 

edgei, 1 < i < m eqnals to an ordered pair (tj,tk) denoting that task 
tj sends data to task m < ; otherwise G cannot be acyclic. 

• Let PC = {T', D) be another directed acyclic graph, where T' C T. 
D dehnes the precedence constraints (dependencies) that might exist 
between pairs of tasks in T'. More formally, D = {di, ...,d/} is a set 
of I ordered pairs: di = (tj,tk),l < i < I, I < j < k < n, where 
each snch pair denotes that tj mnst precede tk in any valid G. In 
other words, G shonld contain a path from tj to tk- This implies that 
if D contains {ta^h) and (tb,tc), it mnst also contain (ta,tc). The PC 
graph corresponds to a higher-level, non-execntable view of a data flow, 
where the exact ordering of tasks is not dehned; only a partial ordering 
is dehned instead. 

• Two execntion plans Gi and G 2 that respect all the precedence con¬ 
straints in PC are termed as logically equivalent hows. 

In this work we initially focns on single-input single-output (SISO) hows. 
A SISO data how is dehned as a how G that contains only one task with no 
incoming edges from another task and only one task with no ontgoing edges. 
The task with no incoming edges is termed as the source task and the task 
with no ontgoing edges is termed as the sink task. In a SISO how, there is 
a dependency edge d from the sonrce task to any other non-sink task, and 
from all non-sonrce tasks to the sink task. 

Examples of SISO hows are given in Figure [T] In the hgure, we can see 
that a SISO how can be executed both as a linear how and as a parallel how. 
In linear physical hows, G has the form of a chain, and each non-source and 
non-sink task has exactly one incoming and one outgoing edge. In parallel 
physical hows, the output of a single task can be fed to multiple tasks in 
parallel. The linear how and the parallel hows in the hgure are logically 


^In the remainder of the paper, we will use the terms tasks, services and activities 
interchangeably. 
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Figure 1: Examples of two logically equivalent parallel execution plans of a SISO linear 
conceptual data flow. 


equivalent flows. Because each SISO flow is logically equivalent to at least 
one linear G, we call SISO flows as logically (or conceptually) linear flows. 

Each task is further described as a triple =< Ci, seli,mpi >. In a 
dataflow, we assume that each task receives some data items as an input and 
outputs some other data items as a result. Following the database terminol¬ 
ogy, each data item is referred to as a tuple. The task elements are: 


• Cost (cj): we use Ci = 1/ri, 1 < i < n as a metric of the time cost of 
each task, where rj is the maximum rate at which results of invocations 
can be obtained from the i-th task. 


• Selectivity {seli): it denotes the average number of returned data items 
per source tuple for the i-th service. For hltering operators, sell < 1, 
for data sources and operators that just manipulate the input sel = 1, 
whereas, for operators that may produce more output records for each 
input record, sek > 1. 


Input {inpi)\ it denotes the size of the input of the Tth task ti in 
number of tuples per input data tuple. It depends on the product 
of the selectivities of the preceding tasks in the execution plan 
More formally, if is the set of all preceding tasks of ti in G, 


\rpprec\ 

inpi = sel 


r 


Output {outi): The size of the output of the i-th task per source tuple 
can be easily derived from the above quantities, as it is equal to inpiseli. 


From the above quantities, and assuming that selectivities are indepen¬ 
dent, we can infer that inpi is the only task characteristic that depends on 


^Here, there is an implicit assumption that the selectivities are independent; if this 
is not the case, the product will be an arbitrarily erroneous approximation of the actual 
selectivity of the subplan before each task. 
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the position of ti in G; the cost and the selectivity of each task is independent 
of the exact G that may include ti. 

Problem Statement: Given a set of tasks T with known cost and 
selectivity values, and a corresponding precedence constraint graph PG, we 
aim to hnd a valid G that minimizes the sum cost metric (SCM) per source 
tuple, dehned as follows: SCM(G)= inpiCi+inp 2 C 2 + ■■■+inpnCn. The optimal 
plan is denoted as P. 

Note that the input set of tuples are processed by all the tasks of the data 
flow, but typically, some of the input tuple attributes may not be required by 
every flow activity. According to [l^, the unnecessary tuple attributes just 
run through the flow, resembling an assembly-line model. The execution of a 
flow activity is not affected by the unnecessary attributes. This implies that 
the tasks of a flow have the ability to be reordered as long as the precedence 
constraints between the tasks are preserved. 


2.1. Problem Complexity 

In [l^ it is proved that hnding the optimal ordering of tasks is an NP- 
hard problem when (i) each flow task is characterized by its cost per input 
record and selectivity; (ii) the cost of each task is a linear function of the num¬ 
ber of records processed and that number of records depends on the product 
of the selectivities of all preceding tasks (assuming independence of selectiv- 
ities for simplicity); and (iii) the optimization criterion is the minimization 
of the sum of the costs of all tasks. All the above conditions hold for our 
case, so our problem is intractable. Moreover, in it is discussed that “it 


is unlikely that any polynomial time algorithm can approximate the optimal 
plan to within a factor of where 9 is some positive constant. Note 

that if we modify the optimization criterion, e.g., to optimize the bottleneck 
cost metric or the critical path renders the problem tractable 


16, 17 


3. Motivational Case Study in Kettle 

In this section, we present the application of data flows in a real-world 
business tool, named as Pentaho Data Integration (PDI) (Kettle) [l^, in 
order to highlight the impact of optimization proposals in the performance 
of a flow execution. We introduce a data flow (Figure [2]) that analyzes tags 
referring to products, which are retrieved from tweets in Twitter, in order to 
compose a dynamic report that associates sales with marketing campaigns. 
In the following, we analyze the tasks of this data flow and of a flavour of it 
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combined by details of the data set that the data flow process for the case 
study purposes. 

As we observe this data flow has a single streaming source that outputs 
tweets on products and the flow accesses four other static sources through 
lookup operations. The initial streaming source task, called as Tweets, of the 
flow consists of 1,000,000 records of tweets with attributes, such as product 
references, coordinates, timestamps etc. More specihcally, the data flow is 
described as follows. When a tweet arrives as a timestamped string attribute 
{tag), the hrst task is to compute a single sentiment value in the range [-5 +5] 
for the product mentioned in the tweet {Sentiment Analysis). Then, a lookup 
operation which maps product references in the tweet is performed {Lookup 
ProductID) and after this a hlter is applied in order to choose products with 
a specihed range of product id values {Filter products). The next task is also 
a lookup task which maps geographic information (latitude and longitude) 
in the tweet to a geographical region {Lookup Region). In the following, the 
task Extract date from timestamp converts the tweet timestamp to a date 
and then, another hlter is applied for choosing dates for a specihc period of 
time {Filter Dates). In order to implement the task SentimentAvg, where 
the sentiment values are averaged over each region, product, and date, we 
hrst have to sort the values of region, product, and date by applying the task 
Sort Region, Product and Date. The how continues with other two lookup 
operations: the former maps the total sales of a product by the region, 
product and date {Lookup Total Sales) and the latter maps campaigns of 
interest according to the results of total sales taken from the previous task 



Figure 2: A real-world analytic flow. 




















































ID 

Flow Task 

Cost(secs) 

Selectivity 

1 

Tweets (data source) 

1.7 

1 

2 

Sentiment Analysis 

4.5 

1 

3 

Lookup ProductID 

5 

1 

4 

Filter Products 

1.9 

0.9 

5 

Lookup Region 

6.5 

1 

6 

Extract Date from Timestamp 

19.4 

1 

7 

Filter Dates 

2 

0.2 

8 

Sort Region, Product and Date 

173 

1 

9 

SentimentAvg 

10.3 

0.1 

10 

Lookup Total Sales 

10.8 

1 

11 

Lookup Campaign 

11.6 

1 

12 

Filter Region 

2 

0.22 

13 

Report Output 

1 

1 


Table 1: The cost and selectivities values. 

{Lookup Campaign). Finally, the user has the option to narrow down the 
report in order to focus on a specific region with the filtering task Filter 
Region. 

Additionally, there are four intermediate static sources, used as inputs in 
lookup operations, whose cost is embedded in the cost of the task where the 
static records are taken as inputs of the lookup task executions. The source 
task Products has 100 records of product names and ids, while that Region 
source task has 100 records of set of coordinates corresponding to a region 
name. Another source static task named Sales consists of 4,000 sale details, 
such as the sold product name, the price, the quantity, the region where the 
product was sold etc., and the last one static source task, named Campaings., 
has 500 campaign ids combined with the day that these campaigns begin, 
the region that will take place, but also the product ids that each campaign 
concern. 

Table [U shows the selectivity and cost values computed for a specific 
dataset of IM records using a machine with an Intel Pentium G860 CPU 
and 4 GB of RAM. We can observe that the most expensive tasks are the 
grouping and lookup tasks, the cost of which is up to two orders of magnitude 
compared to the less expensive ones. Also, there are three hltering tasks, 
while the rest of the tasks do not modify the number of records (note that 
in general, selectivities may be higher than 1). In this data flow scenario, 
the selectivity values of the lookup and transformation tasks is 1, while the 
selectivity values corresponding to filtering and grouping tasks varies. 

In Table |2] the precedence constraints that tasks have between them are 
presented, having in the left part of the arrows the tasks that must precede 
the tasks that are defined in the right part of the table. This data flow 
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Table 2: The precedence constraints of the data flow in Figure. 


Precedence Constraints 

Sentiment Analysis 

—>• 

SentimentAvg 

Lookup Product ID 


Filter products 

Lookup Product ID 

SortRegion, ProductandDate 

Lookup Product ID 


LookupTotal Sales 

Lookup Product ID 


LookupC ampaign 

LookupRegion 

SortRegion, ProductandDate 

LookupRegion 


LookupTotal Sales 

LookupRegion 


LookupC ampaign 

LookupRegion 


Filter Region 

Extractdatefromtimestamp 


FilterDates 

Extractdatefromtimestamp SortRegion, ProductandDate 

Extractdatefromtimestamp 


LookupTotal Sales 

Extractdatefromtimestamp 


LookupC ampaign 

SortRegion, ProductandDate 

SentimentAvg 


has 38% precedence constraints, as they described in Tabl^ where a fully 
constrained flow with n tasks and 100% PCs has constraints and no 

equivalent ordering alternatives. In real data flow scenarios the preserving 
precedence constraints are approximately 30% or even more, as the flows 
presented in [^. 

A straight-forward implementation is shown in Figure [2l Then, we applied 
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Sentiment Analysis Lookup ProductID Extract Date from Timestamp Filter Dates SentimentAvg Lookup Total Sales Sales 


Optimized bxecution Cost 
(Approximate): 36.5secs 


Figure 3: The optimized plan by a heuristic algorithm of the flow in Figure [H 
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Optimized bxecution Cost 
(Accurate): 18.3 secs 


Figure 4: The optimized plan by an exhaustive algorithm of Figure of the flow in [2] 
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best-performing approximate heuristic to date, which is proposed in [loj |. 
The optimized plan is illustrated in Figure [21 In that case, the performance 
improvement from the initial non-optimized flow is 42% from 63 to 36.5 
seconds. 

Similar to the previous optimization, we applied our exhaustive solution 
to the flow of Figure [2] in order to hnd the optimal flow execution cost. 
In Figure 0] the optimal plan of the initial data flow is depicted. In this 
case, the exhaustive optimization methodology transposes the hltering task 
Filter Region, which at the initial design has been placed at the end as a 
hnal optional step, at the very beginning for this specihc flow due to the 
metadata in Table [H A less obvious optimization is to move the pair of date 
extraction and hltering tasks upstream although the former is expensive and 
not hltering. The execution cost of this optimized plan is 18.3 and results to a 
plan that is 3 times better than initial non-optimized. Both of the mentioned 
optimization methodologies are analyzed in the following sections. 

This is a representative example of a real manually designed data how 
that exhibits signihcantly suboptimal behavior. In general, we can draw 
two observations. Firstly, optimal solutions may yield lower execution costs 
by several factors. A second equally important observation is that even 
in simple cases like the one examined here, existing heuristics may fail to 
closely approximate the optimal solution and generate the plan in Figure jH 
The main reason in this example is that the approximate solution performs 
greedy swaps of adjacent activities; however the region hlter cannot move 
earlier unless the campaign lookup task is moved earlier as well, an action 
that a greedy algorithm cannot cover. 

4. Accurate Algorithms for Linear Execution Plans 

In this section, we present three accurate algorithms for reordering SISO 
data hows in order to generate an optimal execution plan. The algorithms are 
based on backtracking, dynamic programming and generation of all topolog¬ 
ical sortings, respectively. Our main novelty here is that we examine a topo¬ 
logical sorting-based algorithm, despite its worst-case complexity. Counter¬ 
intuitively, as we show in the evaluation, the algorithm is practical even for 
large n, when there are many precedence constraints and, in general, can 
scale better than the two other options. However, still, it cannot be applied 
to arbitrary hows of medium or large size. 
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4-1. Backtracking 

The Backtracking algorithm finds all the possible execution plans gener¬ 
ated after reordering the tasks of a given data flow preserving the precedence 
constraints. The algorithm enumerates all the valid sub-flow plans after 
applying a set of recursive calls on these sub-flows until generating all the 
possible data flow plans. It backtracks when a placement of a task in a spe¬ 
cific position violates the precedence constraints. The algorithm is proposed 
for flow optimization in j6|. 

Complexity: The worst case time complexity of Backtracking is factorial 
(i.e., 0{n\)), since, if there are no dependencies, all orderings will be examined 
in a brute force manner. 


4-2. Dynamic programming 

This algorithm is extensively used as part of the System R-type of query 
optimization to produce (linear) join orderings [l^. The rationale of the 
dynamic programming algorithm (termed as DP henceforth) for data flows 
remains the same, that is to calculate the cost of task subsets of size n 
based on subsets of size n — 1. For each of these subsets, we keep only the 
optimal solutions, which are valid with regards to the precedence constraints. 
Specifically, the DP algorithm considers each flow of size n as a flow of (n, — 1) 
tasks followed by the nth task; the key point is that the former part is the 
optimal subset of size n — 1, which has been found from previous step; then 
the algorithm exhaustively examines which of the n flow tasks is the one that, 
when added at the end, yields an optimal subplan of size n. For example, 
the algorithm starts by calculating subsets that consist of only one task {ti}, 
then {^ 2 }, {^ 3 } and so on. In a similar way, in the second step, it examines 
subsets containing two tasks, i.e., {fi, ^ 2 }, {^ 1 ,^ 3 } and so on, until it examines 
the complete flow {fi, ^ 2 , ■••5 ^n}- The number of the optimal (non-empty) 
subsets of a flow is equal to 2” — 1. More details, along with pseudocode and 
an example are provided in Appendix A[ 

Complexity: The time complexity is 0(n^2"'). This is because we examine 
all subsets of n tasks, which are 0(2”). For each subset, which is up to size 
0(n), we examine whether each element can be placed at the end of the 
subplan. Each such check involves testing whether any of the rest n — 1 tasks 
violate a precedence constraint, when placed before the n-th task. Overall, 
for each element, we make 0{n) comparisons. So, the overall time complexity 
is 0(2”)0(n)0(n) = 0(?7,^2”). The space complexity is derived by the size of 
the auxiliary data structures employed. We use three vectors of size 2” — 1 
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as explained in Appendix A the one of which stores elements of size 0{n). 
So the space complexity is 0(n2"'). 

4-3. Topological sorting 

The TopSort algorithm is a topological sorting algorithm based on (l^ . 
which hnds all the possible topological sortings given a partial ordering of a 
hnite set; in our case the partial ordering is due to the precedence constraints. 
The reason behind using this algorithm is that it (implicitly) prunes invalid 
plans very efficiently and it generates a new plan based on a previous plan af¬ 
ter performing a minimal change. For the purposes of this work, we adapted 
the topological sorting algorithm in order to generate all the possible execu¬ 
tion plans of a data flow and detect the execution plan with the minimum 
cost. The algorithm assumes that it can receive as input a valid task per¬ 
mutation ti —)■ ^2 —t ts —)■ ... tn, which is trivial since it can be done in 
linear time. We generate all other valid execution plans by applying cyclic 
rotations and swapping adjacent tasks. 

Firstly, the process of generating all the valid flow execution plans begins 
with the topological sorting of the n — 1 tasks ^2 of the 

flow. Based on this partial sorting, we generate all the valid orderings of the 
^ ^2 ^ ^3 ^ t tn plan. Specihcally, in the hrst stage of the algorithm 

the task ti is placed on the left part of the partial plan ^2 ^ ^3 ^ 1 and 

in the next steps of this stage, we swap it with the tasks on its right, while 
the tasks of the partial plan maintain their relative position. The ti stops 
moving when such a swap violates a precedence constraint. Then, as the 
task ti cannot be further transposed, the second stage of algorithm begins 
with a right-cyclic rotation of another partial plan consisted of ti and all the 
tasks that precede it, which means all the tasks which are positioned to its 
left. In this way, ti is placed to its initial position. Similarly, we generate all 
the topological sortings of ^2 —t ^3 —t ...—)• tn, ^3 —t ^4 —t ...—)■ tn and so on. 
For example, the topological sorting of ^4 partial plan will 

be generated with the transpositions of task t^. For each generated plan, we 
estimate the total execution cost and finally, we choose the flow execution 
plan with the best performance. A pseudocode, a brief example and further 
details are in [Appendix B 


Complexity. Since the algorithm checks all the permutations the time 
complexity is 0 (? 7 .!) in the worst case. However, compared to other algorithms 


that produce all topological sortings, it is more efficient [19[. The space 
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complexity is 0{n) because only one plan is stored in main memory at any 
point of execution. 


5. Approximate Algorithms for Linear Execution Plans 

Due to the high complexity of the problem in hand, we need to develop 
approximate solutions for the generic case. This section consists of two parts: 
we hrst present existing solutions including straightforward extensions of 
existing proposals that are applicable to our problem, and then we present 
our main novelty with regards to approximate optimization of linear data 
flows. As will be shown in the evaluation, there is a signihcant gap in the 
performance between optimal solutions and existing approximate algorithms, 
and our proposal hlls that gapH 


5.1. Existing Solutions 

Here we present four algorithms, which reflect the current state-of-the-art 
in task re-ordering in linear flows. Implementation details and examples are 
provided in [Appendix 


C 


5.1.1. Swap 

The Swap algorithm starts with a random valid execution plan. Such a 
plan is trivial to be computed in linear time through a single topological or¬ 
dering of PC. The algorithm then compares the cost of the existing execution 
plan against the cost of the transformed plan, if we swap two adjacent tasks 
provided that the constraints are always satished. We perform this check for 
every pair of adjacent tasks and we repeat until no changes occur. Swap is 
equivalent to the proposal in [l^ when only task re-ordering is allowed. The 
complexity of the Swap algorithm is 0{n?) because we can repeat at most 
n times, and each iteration has 0{n) complexity. The space complexity is 
linear (0(n)), equal to the complexity needed to store a single plan. 

In order to prove that Swap is approximate, it is adequate to provide 
at least one example that the algorithm fails to yield the optimal solution. 
Assume a flow, which has three inner tasks (i.e., tasks other than the source 
and sink ones), each with cost equal to 1 and selectivities 1, 1.1, and 0.5 
respectively. There is also a precedence constraint between tasks 2 and 3. If 
the initial plan is H —)■ ^2 fs, then its SCM = 1 -|- 1 -|- 1.1 = 3.1. However, 


^An initial introduction of the existing algorithms has appeared in 


14 




the optimal plan is ^2 ^ ^3 ti with SCM = 1 + 1.1 + 0.55 = 2.65. 
Swap cannot produce that plan because it cannot perform transpositions 
that initially produce worse plans, but eventually lead to better solutions, 
such as the swap of tasks ti and ^ 2 - 

5.1.2. Greedyl and Greedyll 

Greedyl starts with an empty plan and in each step, it adds the activity 
with the maximum value of (1 — seli)/{ci), provided that it meets the prece¬ 
dence constraints. In the hrst step, the source task is chosen as the only 
eligible one. It bears similarities with the Ghain algorithm in Q , although 
the latter algorithm was proposed for a different problem and appends the 
activity that minimizes Cj. The time complexity of Greedy algorithm is 0{n^) 
because it consists of n steps, where in each step 0{n) checks are performed 
to hnd the most efficient and valid task to append. With the help of appro¬ 
priate data structures, the complexity can drop to 0{nlogn). 

Similarly to Swap, it may miss the optimal solution. For example, in the 
example with the three tasks of cost 1 and selectivities 1, 1.1 and 0.5, and a 
precedence constraint between t 2 and fs, Greedyl will first append fi, then 
^2 and last t^, which is not the best possible plan as explained earlier. 

Another greedy algorithm is Greedyll |2l|. The rationale of Greedyll is 
similar to Greedyl apart from the fact that the construction of the optimized 
execution plan is right-to-left (i.e., from the sink to the source). 

5.1.3. Partition 

Partition forms clusters with activities by taking into consideration their 
eligibility. Specifically, each cluster consists of activities that their prerequi¬ 
sites have been considered in previous clusters. After building the clusters, 
each cluster is optimized separately by checking each permutation of cluster 
tasks. Similar to Greedyl, it was first proposed for data integration systems, 
and the details are given in 0, Partition runs in 0{n\) time in the worst 
case because, if there are no precedence constraints, it checks all permuta¬ 
tions of a partition of size n. In general, its complexity is 0{k\), where k is 
the size of the largest cluster, and thus is inapplicable if a cluster contains 
more than a dozen of tasks. As in the previous optimality examples with 
the three tasks, it is easy to verify that it cannot find the optimal plan. In 
the first step, it forms a cluster with tasks G and ^2 and decides to place G 
before t 2 because is yields a better subplan. 
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Figure 5: Average (left) and maximum (right) improvements of exhaustive solutions 
Algorithm 1 Rank ordering based high-level algorithm 
Require: A set of n tasks, T={ti, tn\ and the PC graph 
Ensure: A directed acyclic graph P representing the optimal plan 
1 : Pre-processing phase 
2 : Apply KBZ algorithm 
3: Post-processing phase 


5.2. Algorithms based on rank ordering 

The motivation behind onr proposal is that the approximate solntions dis- 
cnssed previously deviate significantly from the optimal orderings. To prove 
this, we conduct experiments with small flows, where applying an exhaustive 
technique to obtain the optimal plan is feasible. More specifically, in Figure 
[S](left), we examine 100 randomly generated data flows consisting of 15 tasks 
with cost G [1,100], sel E (0,2] and 20%-95% precedence constraints. The 
results show that the performance improvement derived by the application 
of an accurate algorithm is high; see that TopSort algorithm can have up 
to 57% better performance improvement compared to a random initial flow 
that just respects the precedence constraints. In general. Swap seems to be 
the heuristic algorithm with the best performance improvement on average. 
In Figure [5][right), the maximum normalized difference between Swap and 
TopSort algorithms is presented. As we can observe, there are cases where 
the TopSort algorithm has 74% better performance improvement than the 
best heuristic. These findings highlight the need for proposing new approxi¬ 
mate optimization methodologies, in order to provide more near-optimal flow 
execution plans. 
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To fulfill this need, we propose a set of rank ordering-based approximate 
algorithms and we analyze them in this section. We build upon the join 

H, 


ordering algorithms proposed for query optimization in [22|, |23| , which will 


be referred to as KBZ. This algorithm leverages the rank value of each task 
dehned as ^nd the dependencies among tasks. Our solutions can be 

described at a high-level as shown in Algorithm [TJ The main novelty is how to 
preprocess the flow, so that KBZ becomes applicable. Also, we post-process 
the result of the KBZ algorithm in order either to guarantee validity or to 
further improve the intermediate results. There are many options regarding 
how these two phases can be performed and here we present three concrete 
suggestions, which constitute the novelty of this section (examples are shown 


m 


Appendix D). 


5.2.1. KBZ 

The KBZ algorithm, which was proposed in , is a seminal query opti¬ 
mization algorithm for join ordering. This algorithm considers only a specihc 
form of precedence constraints, namely those representable as a rooted tree. 
The rationale of this algorithm is to order tasks according to their rank 
value. In the case that this is not possible due to the dehned precedence con¬ 
straints, the tasks are merged and the rank values are updated accordingly. 
The fact that KBZ algorithm allows only tree-shaped precedence constraint 
graphs implies that there should be no task with more than one independent 
prerequisite activity, and in such data how scenarios, the percentage of prece¬ 
dence constraints is very low and decreases more with the number of tasks 
(e.g., less than 10% for a 100-node how). Both of these cases do not occur 
frequently in practice. The time complexity of KBZ algorithm is O(n^). 


5.2.2. RO-I 

In our hrst proposal, called RO-I, the pre-processing phase ensures the 
transformation of the PC graph into a tree-shaped one. This is done by re¬ 
moving incoming edges with no maximum rank, if a task has more than one 
incoming edge. This allows KBZ to run but may produce invalid how order¬ 
ings. To hx that, we employ a post-processing phase where any resulting PC 
violations are resolved by moving tasks upstream if needed as prerequisites 
for other tasks placed earlier. 

The worst case complexity of the pre-processing phase is 0{rR) because 
we remove up to n — 1 incoming edges {0{n) complexity) from each task and 
we repeat this for n — 1 tasks of the how. Additionally, in the post-processing 
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Figure 6: A merging example. 


step, we check, for each of the n tasks, if any of the preceding tasks violates 
the precedence constraints. There can be np to n — 1 preceding tasks in a 
flow ordering. So, in the worst case, the complexity is 0{n‘^). However, in 
practice the average time complexity is mnch lower for both phases. 


5.2.3. RO-II 

The RO-II algorithm follows a different approach in order to render KBZ 
applicable. In the pre-processing phase, this approximate algorithm first 
detects paths in the precedence constraint graph that share an intermediate 
sonrce and sink. Then it merges them to a single path based on their rank 
valnes. When there are multiple snch paths, we start merging from the most 
npstream ones and when there are nested paths, we start merging from the 
innermost ones. In that way, all precedence constraints are preserved at the 
expense of implicitly examining fewer re-orderings. An example is shown 
in FignreO In that example, after the merging procednre we enforce more 
precedence constraints than the original ones, so that the task mnst precede 
not only task bnt also tasks t 2 and t^. In other words, the merging process 
imposes more restrictions on the possible re-orderings. As snch, these local 
optimizations may still deviate from a globally optimal solntion significantly 
in the average case. RO-II does not reqnire any post-processing becanse its 
resnlt is always valid. 

RO-II, as will be shown in the evalnation section, in general behaves 
better than RO-I. However, in some cases RO-IPs performance is mnch worse 
and an example is in Appendix D Also, in the case of RO-II, the time 


complexity remains becanse for each merge process we consider at 

most 0{n) flow tasks and we repeat this for all the possible merge processes 
that can be np to n. 
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Algorithm 2 RO-III 

Require: A set of n tasks, T={ti, tn} 

A directed acyclic graph PC with precedence constraints 
Optimized plan P as a directed acyclic graph returned by RO-II 
Ensure: A directed acyclic graph P representing the optimal plan 
1: repeat 

2 : {k is the maximum subplan size considered} 

3: for i=l:k do 

4: for s=l:n-i do 

5: for t=s+i;n do 

6 : consider moving subplan of size i starting from the task after 

the task 

7: end for 

8 : end for 

9: end for 

10: until no changes applied 


5 . 2 . 4 . RO-III 

After the evaluation of the proposed RO-I and RO-II algorithms, we 
isolated data flow cases that were not near-optimal. For example, the RO-II 
was not able to reorder a hltering task in an earlier stage of the flow, even 
when was not restricted by precedence constraints, in order to reduce the data 
that the flow will process. To £11 this gap, we propose RO-III to support 
the efficient optimization of such data flow cases. The RO-III algorithm, 
presented in Algorithm [21 tackles the limitations of RO-II with the help of 
a post-processing phase that we introduce. Specifically, we apply the RO-II 
algorithm in order to produce an intermediate execution plan, and then we 
examine several transpositions. More specifically, we check all the possible 
transpositions of each sub-flow of size from 1 to A: tasks in the plan. The 
checks are applied from the left to the right. In this way, we address the 
problem of a task being “trapped” in a suboptimal place upstream in the 
flow execution due to the additional implicit constraints introduced by RO- 
II (see the transposition of tj in Figure ID.241 in Appendix D). This process 
is described by the 3 nested for loops in Algorithm [2] and is repeated until 
there are no changes in the flow plan. The reason we repeat it is because 
each applied transposition may enable further valid transpositions that were 
not initially possible. 
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Figure 7: An example of flow parallel execution. 


The post-processing phase of the RO-III algorithm has 0{krR) complex¬ 
ity, which is derived by the maximum number each of the three inner loops 
can execute. The repeat process in theory can execute up to n times, but 
in practice, even for large flows, there is no change after 3 times. In all 
experiments, we set k to 5. 

6. Parallel Optimization Solutions 

This section focuses on the advantages of parallel execution plans. As 
we have explained in Section O in a parallel physical flow each single task 
can have multiple outgoing edges, which implies that the output of such a 
task is fed, as input, to multiple tasks. In the right part of Figure [H we 
observe that a single task may have not only multiple outgoing edges, but 
also multiple ingoing edges. In this case, a single task receive as input data 
the output of multiple tasks. This is in line with the AND-Join workflow 
pattern as presented in j^, where the outgoing edge of multiple tasks that 
are executed in parallel converge into a single task. 

This case can be considered as a merge-split process, which in software 
tools such as PDI can be implemented by incorporating a merge join process. 
As such, merging multiple input streams incurs an extra execution cost. To 
assess this cost, we evaluated parallel data flows that were executed with the 
PDI tool. The conclusion was that the merge task cost has a small effect on 
the total flow execution cost; in other words, the merge task is similar to an 
additional lightweight activity. Additionally, the size of the input {inpi) of a 
task ti, which receives more than one incoming edge is dehned similarly to 
the tasks with only one incoming edge, i.e., by computing the the product of 
the selectivity values of the preceding tasks as we have described in Section 

El 
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Let us now analyze when the parallel flow execution may be beneflcial 
through a theoretical example. Let us consider two subsequent tasks and 
^4 illustrated in Figure [TJ which do not have precedence constraints between 
them and an extra cost of the merge process that will be denoted as me. 
In this figure, we show two alternative plans, a linear one (in the middle) 
and a parallel one (on the right). The SCM values of the two alternatives 
vary only with respect to activities and t^. We distinguish between the 
following four cases (using a superscript to differentiate the inputs in the two 
cases): 

• Case 1: se /3 < 1 and se /4 < 1. The linear execution cost is lower 

than the parallel execution cost, because (i) mp 4 "^®“^C 4 < as 

inphnear ^ and se/3 < 1 , and (ii) 

me) due to the extra merge cost of the parallel version and given that 
j^^phnear _ go, in that case, parallelism is not beneflcial. 

• Case II: seZs < 1 and sel^ > 1. Similar with the Case I, the linear 
execution of the flow is more beneflcial than the parallel; note that the 
selectivity value sel^ does not affect the previous statements. 

• Case III: sel^ > 1 and sel^ > 1. If me = 0, the parallel execution 

results in better performance than the linear execution. In that case 
mp 5 "'®“^C 5 = + me). Because of the fact that sel^, > 1, we 

deduce that e^ > e^. In the generic case where me > 0, 

we need to compute the estimated costs in order to verify which option 
is more beneflcial, but we expect that, for small me values, the parallel 
execution to outperform. 


• Case IV: sel^ > 1 and sel^ < 1. Following the rationale of the previous 
case, there is no clear winner between the two executions shown in 
Figure [3 However, an optimized linear plan will place before ^3 thus 
corresponding to Case I, where the (new) linear plan is better than the 
parallel one. 


As in the previous section, we first describe a simple extension to an ex¬ 


isting solution for ordering web services described in [1^ . Then we propose a 


novel post-processing step that applies to any of the solutions in the previous 
section in order to render their output plans into parallel ones. Our solution 
leverages and generalizes the analysis above, and based on the findings of 
Case III, it parallelizes tasks with selectivity higher than 1. 
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An optimized SiSO 
execution plan 

selj > 1 sel, > 1 

sel^ = 1 selj > 1 seig = 1 

• must precede all the other 
tasks 

• All the tasks must precede tg 


selj > 1 sel^ > 1 seig < 1 

sel^ = 1 selj > 1 seig > 1 sel^ = 1 

• t^and t 3 must precede t^ 

• t^must precede tg 


A parallel optimized 
execution plan 





Figure 8: Example of executing SISO flows in parallel. 

6.1. PGreedyl and PGreedyll 

The PGreedyl optimization algorithm has the distinctive feature of gen¬ 
erating parallel flow execution plans. The rationale of the PGreedyl is to 
order the flow tasks in such a way that the amount of data that is received 
by the tasks with selectivity value > 1 is reduced by pushing the selective 
flow tasks (hltering tasks) in an earlier stage of the flow to prune the input 
dataset. Based on the selectivity values, the optimal execution plan may 
dispatch the output of a task to multiple other tasks in parallel, or place 
them in a sequence. Specihcally, the flow tasks having selectivity value > 1 



A weak point of PGreedyl is that, in each step, it tries to hnd the task 
that has the minimum cost without considering the implications for the next 
tasks (e.g., due to high selectivity). The second flavour, PGreedyll, chooses 
not the activity with the less cost but the activity with the highest rank 
value; in this way we penalize tasks that have low cost but high selectivity, 
which can yield lower SGM values for the overall plan. Both alg orithms have 
time compexity in 0{n^) in the worst case, as explained in (l6[ |. 

6.2. Executing SISO flows in parallel 
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Algorithm 3 Post-process step for parallel SISO flows 
Require: An optimized linear plan P={ti ... —)■ tn} 

A directed acyclic graph PC with precedence constraints 
Ensure: A directed acyclic graph P representing the optimal parallel plan 
1 : i=l 

2: while i < n do 

3: j=i+l 

4: while selpf^j) > 1 do 

5: Delete the edge between the tasks tp(j-i) from P 

6 : if tpQ') is not predecessor in PC for no task in tj+i... tj-i then 

7: Connect the edge between the tasks (i) tp(^i) and (ii) tpQ), i.e., 

create the edge tp(i) —)■ tp(^j) in P 
8 : else 

9: Connect in P the edge between (i) all the preceding tasks in PC 

with no outgoing edges in P and (ii) tp^) 

10: end if 

11: J =J + l 

12: end while 

13: Connect in P the edge between (i) all the tasks tpp+i).. .tpQ_i) with 

no outgoing edges in P and (ii) tp(^j) 

14: i=j 

15: end while 


In order to exploit the advantages of the proposed optimization tech¬ 
niques, in Algorithm El we introduce a post-process phase for executing data 
flows in parallel. To this end, after the generation of an optimized linear 
execution plan, we apply a post-process step that restructures the flow in a 
way that subsequent tasks having selectivity greater than 1 to be executed 
in parallel if this does not incur violations of the precedence constraints. 
This post=process step can be applied to any optimization algorithm that 
produces a linear ordering. 

An example is presented in Figure El where in the upper flow scenario, 
we choose to parallelize the tasks t 2 , and ^ 4 , while in the flow case that is 
depicted in the bottom of the hgure, we execute parallel only the tasks ^2 and 
fs and not ^ 4 , because of the precedence constraints. Then, is appended 
after t 2 because of the constraints and is executed in parallel with t^. As the 
task tg has selectivity value < 1 , it is not executed in parallel with any other 
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Figure 9: Example MIMO data flows of type butterfly (left) and fork (right). 


task. 

The complexity is O(n^). The parallelization of each task is examined at 
most once, and each such case the preceding tasks need to be checked, the 
number of which cannot exceed n. 

7. Extensions to MIMO flows 


Algorithm 4 Optimization of MIMO flows 
1: repeat 

2 : Extract SISO segments 

3: for all SISO segments do 

4: Optimize SISO segments 

5: end for 

6 : Apply factorize/distribute optimization thus modifying the SISO seg¬ 

ments 

7: until no changes 


So far we have discussed the case with a single source and a single sink 
task, but arbitrary multiple-input multiple-output (MIMO) flows can beneflt 
from the solutions presented in the previous sections. The generic types of 
MIMO flows are described in j^, two of which are shown in Figure [9l A 
main difference between SISO and MIMO flows is that apart from re-ordering 
tasks, additional optimization operations can apply. As explain in jl^, the 
factorize and distribute operations can move an activity appearing in both 
input subflows of a binary activity to its output and the other way around, 
respectively^ This allows for example a filtering operation initially placed 
after a merge task to be pushed down to the merge inputs (provided that 


'^(l^ additionally considers the case that an activity can be further split in several 
sub-activities, which is not considered here. 
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the filtering condition refers to both inputs), which is known to yield better 
performance. 

As we can see in Figure |9l the MIMO flows consist of sub-linear flows. 
Therefore, the optimization of SISO data flows can play an important role 
in optimizing MIMO flows. Algorithm H] describes a proposal for optimizing 
MIMO flows, which is based on the extraction of the linear segments of the 
flow and apply optimization algorithms only on the SISO sub-flows. Then, 
we check whether we can apply the factorize/distribute operations, which 
modify the linear segments. This process is repeated until it converges. In 
this work, we focus solely on task re-ordering (which corresponds to optimize 
the linear segments individually) and the investigation of further techniques 
that combine task re-orderings with additional operations is left for future 
work. 

8. Experimental Analysis 

In this section we present a set of experiments, which have been conducted 
in order to evaluate the following two factors: 

• Performance optimization, which corresponds to the minimization of 
the estimated flow execution cost SOM. The performance improvements 
are measured as the percentage of the decrease in SOM after optimiza¬ 
tion. 

• Time Overhead^ in terms of real time that the generation of the opti¬ 
mized execution plan requires. 

We construct synthetic flows so that we thoroughly evaluate the algo¬ 
rithms in a wide range of parameter combinations, so that we can derive 
unbiased and generically applicable lessons for the behaviour of each algo¬ 
rithm. The main conhgurable parameters are three: (i) the number of tasks 
n ranging from 10 up to 100 (without including the source and the sink 
tasks) thus covering a range from small to very large data flows; (ii) the cost 
and selectivity values of the flow tasks, which are distributed in the range 
of [1,100] and (0,2], respectively (following either the uniform or the beta 
distribution); and (iii) the number of precedence constraints between the 
flow tasks; in general we consider cases where there are constraints, 

where a G [0.1,0.98]. The larger the a value, the less the opportunities for 
optimization exist. For small a values, there are few PCs, which implies 
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the existence of several valid re-orderings. However, when a becomes 0, the 
problem reduces to hlter ordering in database queries without precedence 
constraints and thus is out of our interest. Remember that in real cases, we 
expect PCs to be above 30%. 

In order to conduct the experiments, we randomly generate PC DAGs 
and task characteristics in a simulation environment. Unless otherwise men¬ 
tioned, every experiment is repeated 100 times and the average values are 
presented. When discussing real times, we use a machine with an Intel Core 
i5 660 CPU and 6 GB of RAM. 

8.1. Performance Improvements 

In the beginning of Section 15.21 we presented the signihcant gap between 
the best performing heuristics to date, namely Swap, and the accurate solu¬ 
tions for small flows. We extrapolate that this gap remains, if not widens, 
for larger flows. The main purpose of this part is to show how the rank 
ordering-based solutions are capable of hlling this gap, and then we discuss 
the performance benehts due to parallelism in SISO flows. Finally, we eval¬ 
uate the proposals for MIMO flows. 

algorithms is presented. As we can observe, there are cases where the 
topSort algorithm has 74% better performance improvement than the best 
heuristic. These hndings highlighted the need for proposing new approx¬ 
imate optimization methodologies, in order to provide more near-optimal 
flow execution plans. 

8.1.1. Performance of Rank Ordering-based Solutions 

Figure ITO] presents the results of the comparison of rank ordering-based 
optimization methodologies with the initial flow execution plan and Swap. 

The values of the results are normalized according to the performance of 
the initial (random) execution plan. The four sub-hgures present the perfor¬ 
mance improvement of each optimization proposal for PCs = 20%, 40%, 60%, 80%, 
respectively. Based on these results, a main observation is that RO-III is a 
clear winner, as it outperforms all the other optimization algorithms on av¬ 
erage for all the PC percentages examined. The lesson is that the average 
improvements of RO-III over Swap can be signihcant, as the RO-III can 
yield up to 41% better performance than Swap on average; this difference is 
observed for n = 80 and PC=40%, and means that RO-III is on average 1.69 
times faster than Swap for that case. In addition, the maximum observed 
speed-up in isolated cases is much higher. For example, in one run where 
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Figure 10: Improvements in the SCM metric for PCs=20% (top-left), for PCs=40% (top- 
right), for PCs=60% (bottom-left)and for PCs=80% (bottom-right). 


n = 60 and PC=60%, we have observed a speed-up of more than 73 times 
in favor of RO-III. In another run for n = 100 and PC=40%, the speed-up 
exceeded 285 times (two orders of magnitude). 

RO-I seems to outperform RO-II for 80% precedence constraints on av¬ 
erage, however, if we zoom on the isolated runs, in a signihcant portion of 
plans, RO-II is better. For less precedence constraints, there is not a clear 
winner between RO-I and RO-II. 

Another signihcant observation from this hgure, combined with Figure O 
is that RO-III eliminates the gap between approximate and accurate solu¬ 
tions for 15-task hows. This provide strong insights into the near-optimality 
of RO-III in practice although no real experiments are feasible in order to 
establish the ground truth for bigger hows and near optimality cannot be 
theoretically proved (most probably), as explained in Section El 

The experiments above refer to uniformly distributed values of costs and 
selectivities. We repeat the experiments, when those values follow the beta 


distribution, which can describe selectivities, as explained in j26|]. We have 


tested several parameters of that distribution, without big diherences; here 
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Table 3: Normalized performance for data flows with 40% engine constraints 


Uniformly Distributed Cost and Selectivity Values 

n 

Initial 

RO-I 

RO-II 

RO-III 

Swap 

Avg Diff 

Max Diff 

20 

1.0000 

0.3339 

0.3566 

0.2841 

0.4101 

0.2636 

0.8102 

50 

1.0000 

0.2696 

0.2679 

0.1780 

0.2761 

0.3281 

0.9802 

80 

1.0000 

0.2181 

0.2225 

0.1420 

0.2355 

0.4069 

0.9663 

100 

1.0000 

0.2149 

0.2005 

0.1478 

0.2120 

0.2900 

0.9965 

Beta Distributed Cost and Selectivity Values 

n 

Initial 

RO-I 

RO-II 

RO-III 

Swap 

Avg Diff 

Max Diff 

20 

1.0000 

0.3509 

0.4235 

0.2837 

0.4035 

0.2756 

0.9562 

50 

1.0000 

0.3942 

0.2287 

0.1075 

0.2310 

0.4865 

0.9898 

80 

1.0000 

0.1356 

0.1945 

0.0553 

0.1403 

0.6041 

0.9949 

100 

1.0000 

0.0944 

0.1591 

0.0538 

0.1141 

0.5699 

0.9995 


we present the results when the two main beta distribution parameters are 
set to a = 6 = 0.5. Table |3] presents the results of performance improvement 
of the RO-I, RO-II, RO-III and Swap heuristics normalized according to 
the cost of the initial randomly generated plan; the PCs are 40%. The 
last two columns of Table [3] are computed as follows over all 100 iterations: 
AvgDiff= ^ ^ and MaxDiff= and as such 

the closer the values to 1 the bigger the relative improvement of RO-III. 

The main observation here is that for beta-distributed values, the per¬ 
formance of RO-III against Swap improves even more. In the case of flows 
that consist of 80 and 100 tasks, the RO-III results in 60% and 57% less 
SOM, which implies a 2.5x and 2.32x speed-up, respectively; this reduction 
is signihcantly higher than the one for uniformly distributed metadata. In¬ 
terestingly, in a specific iteration, the maximum observed decrease is by 3 
orders of magnitude. In general, especially for large flows, the performance 
improvements for beta-distributed values are higher for all techniques. 


8.1.2. Performance of Parallel Optimization Solutions 

This set of experiments is conducted in order to evaluate the performance 
of data flows when they are executed in parallel according to the techniques 
discussed in Section O To this end, we compare the parallel version of Swap, 
named as PSwap, against the parallel proposed rank ordering-based algo¬ 
rithms, denoted as PRO-I,PRO-II,PRO-III, respectively. We also compare 
against PGreedyll, which outperforms PGreedyl as shown in additional ex¬ 
periments in Appendix E Initially, we assume that the merge cost me is 0, 
but we relax this assumption later. 

The comparisons are presented in Table IU where it is shown that the 
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Table 4: Normalized performance for data flows with n=50,100 tasks. 


n=50 

alg\PCs(%) 

20 

40 

60 

80 

Initial 

1.0000 

1.0000 

1.0000 

1.0000 

PSwap 

0.1759 

0.2723 

0.3329 

0.4987 

PSwap' 

0.1804 

0.2812 

0.3448 

0.5177 

PGreedyll 

0.1052 

0.1839 

0.2842 

0.4413 

PGreedyll' 

0.1057 

0.1865 

0.2921 

0.4552 

PRO-I 

0.1340 

0.2277 

0.2949 

0.4534 

PRO-I' 

0.1363 

0.2321 

0.3011 

0.4629 

PRO-II 

0.1171 

0.2418 

0.4455 

0.5355 

PRO-ir 

0.1188 

0.2497 

0.4686 

0.5579 

PRO-III 

0.0989 

0.1600 

0.2156 

0.4012 

PRO-III' 

0.0990 

0.1605 

0.2166 

0.4062 

n=100 

alg\PCs(%) 

20 

40 

60 

80 

Initial 

1.0000 

1.0000 

1.0000 

1.0000 

PSwap 

0.0855 

0.1428 

0.2087 

0.3440 

PSwap' 

0.0886 

0.1488 

0.2197 

0.3580 

PGreedyll 

0.0485 

0.0765 

0.1274 

0.2635 

PGreedyll' 

0.0485 

0.0769 

0.1299 

0.2719 

PRO-I 

0.0793 

0.1264 

0.2013 

0.2994 

PRO-I' 

0.0820 

0.1302 

0.2072 

0.3069 

PRO-II 

0.0605 

0.4507 

0.2522 

0.4073 

PRO-II' 

0.0618 

0.4911 

0.2671 

0.4278 

PRO-III 

0.0465 

0.0681 

0.1058 

0.2183 

PRO-III' 

0.0465 

0.0681 

0.1063 

0.2204 


parallelized version of RO-III, PRO-III, strengthens its position as the best 
performing technique. When the merge cost is considered, the names of the 
algorithms are coupled with the prime symbol; for the moment we do not 
focus on those table rows. For linear flows, when n=50 and PCs=40%, RO- 
III results in decrease of the SCM of Swap by 32% (see Table [2D. In a parallel 
setting, the decrease in SCM comparing PSwap and PRO-III reaches 41%. 
Also, for n=100, the performance improvements reach 52% (from 29%, see 
Table IHD. The relative improvements are similar for PCs=60% and slightly 
less for PCs = 20% and PCs= 80%. 

A question arises as to how often parallelization leads to benehts. An¬ 
alyzing the individual runs, we have observed that the number of such oc¬ 
currences is less than 10%, if we count only improvements higher than 2%. 
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Figure 11: MIMO optimization for n=100,200 and 40% precedence constraints 

Nevertheless, the magnitude of the improvements is strong;y correlated with 
the number of PCs. For less constraints settings (PCs = 20%), for both ?7,=50 
and ?7,=100, we have observed speed-ups of an order of magnitude. When 
PCs=40%, the maximum observed speed-up drops to 4 and 3 times, respec¬ 
tively. For even more PCs, this speed-up does not exceed 12.7%. A final note 
is that PGreedyl is the best performing parallel heuristic from those not fully 
proposed in this work. The main conclusion up to here is that further refining 
the linear orderings with our proposed light-weight post-processing step can 
yield tangible performance improvements, and our proposals lead to further 
advancements in the current state-of-the-art in linear flow optimization. 

Next, we repeat the experiments with non-zero merge cost, and the results 
verify that its impact is negligible (see Table H]). After real experiments with 
the PDI tool, we set me = 10, that is an order of magnitude higher than 
the less expensive tasks and an order of magnitude lower than the most 
expensive ones. Overall, on average, our best performing solution, namely 
PRO-III continues to have average performance improvements against Swap 
of an order of magnitude. 

8.1.3. Performance of MIMO flows 

This set of experiments considers the evaluation of the methodology that 
is analyzed in Section [7] for MIMO data flow optimization. We consider two 
cases of butterfly flows (see Figure [9]( left)). In each case we consider 10 linear 
segments with 10 and 20 tasks, respectively; thus the overall number of tasks 
is 100 and 200. The percentage of PCs is 40%. 

Figure [TT] presents the average performance improvements of the PRO-III 
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Figure 12: Optimization overhead for DP and topSort with 50% precedence constraints 
and n = 15,...,20 (top-left), for TopSort when n = 10,...,70 and PCs=98% (top-right), and 
n = 15,20 and PCs vary (bottom-left), and for Backtracking and TopSort when n = 15 
with different range of precedence constraints (bottom-right). 


and Swap algorithms over the non-optimized initial data flow. In the case 
where the linear segments are very small (10 tasks) the improvements are 
small as well. When the linear segment size increases to 20, PRO-III has 
34% better performance improvement than Swap, and 74% lower execntion 
cost compared to the non-optimized case. The performance improvements 
are commensurate with those in Table [3l which supports are claim that our 
proposals for SISO flows can be transferred to MIMO settings as well. 


8.2. Time Overhead 

In this section, we conduct a thorough evaluation of the time overhead 
of the accurate optimization algorithms. The purpose of this set of experi¬ 
ments is to show that the application of the exhaustive algorithms, and more 
specifically of TopSort, is limited only to small or very constrained flows. 
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Figure IT^ top-left) presents the average execution time of the DP algorithm 
compared to the TopSort solution for 50% precedence constraints. More 
specihcally, this hgure depicts the time overhead for executing data flows 
with n = 15,20 flow tasks. The main conclusions that can be drawn from 
this hgure is that DP algorithm is not a practical optimization solution even 
for small hows that consist of 19 how activities; the execution of a how with 
20 tasks requires over 3 days using our test machine. Even if the TopSort 
algorithm runs at least 50 times faster than DP, the execution of TopSort 
follows a similar pattern with DP. 

Figure fT^ top-righti shows the average execution time of TopSort for 
hows with n = 10,..., 70 having 98% precedence constraints, which implies 
that the number of the possible re-orderings is quite restricted. TopSort 
does not scale well, but can run in acceptable time even for medium-sized 
hows of 60 tasks. Additionally, Figure [12] (bottom-left) depicts that TopSort 
cannot scale for arbitrary precedence constraints even for hows with 15 and 
20 how activities. For example, the execution time of a data how with 20 
tasks and 50% precedence constraints is 2 orders of magnitude higher than 
the execution time of a data how with 15 tasks. Finally, in the bottom-right 
part of Figure [121 fh® time overhead of Backtracking compared to TopSort 
is presented. The main observation of this hgure, where the precedence 
constraints range is PCs = 90%, ...,98%, is that Backtracking can be up to 
62 times slower than TopSort. 

Overall, we can conclude that TopSort, on the one hand scales better 
than the other techniques and is applicable in specihc cases where the other 
two approaches are not, but, on the other, it is not able to scale in general. 
We do not present the overhead of the approximate solutions, because it is 
negligible. 

9. Related Work 

The existing approaches of how optimization can be classihed in the fol¬ 
lowing main categories, which are subsequently presented in turn: 

• Optimization of the structure of data flows: this category targets the 
methodologies that optimize the how execution plan through changes 
in the structure of the how graph including task re-ordering. 

• Optimization of the resource allocation and scheduling aspects of data 
flows: the proposals in this category deal with issues such as the alloca- 
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tion of computational resources and specific execution engines to each 
part of the flow along with time scheduling details, without affecting 
the workflow structure. 


• Application-dependent solutions: this category contains optimization 
techniques that are specihc to certain settings; interestingly, some of 
these techniques leverage database technologies. 


Optimization of the structure of data flows. An aspect of this category 
that is particularly relevant to our work considers flow optimization inspired 
by query processing techniques. In [^, an optimization algorithm for query 
plans with dependency constraints between algebraic operators is presented. 
The adaptation of this algorithm in our SISO problem setting that does 
not consider only algebraic operators is reduced to the existing optimization 
algorithms we have presented in previous sections, and more specihcally to 
Greedyl and Partition. In j^, ad-hoc query optimization methodologies are 
employed in order to perform structure reformations, such as reordering and 
introducing new services in an existing workflow; in this work we investigate 
more systematic approaches. 

Optimizations of Extract Transform Loading (ETL) flows are analyzed 


m 


Upl 

m. 


10| . Specihcally, the authors consider ETL execution plans as states and 
use transitions, such as swap, merge, split, distribute and so on, to generate 
new states in order to navigate through the state space, which corresponds to 
the execution plan alternatives; they also present optimization algorithms for 
reducing ETL workhow execution cost albeit with exponential complexity. 
In our work, where we consider only task re-orderings, the proposal in |10| 
corresponds to the Swap algorithm, which we have presented and evaluated. 

Another interesting approach to how optimization is presented in j^, 
where the optimizations are based on the analysis of the properties of user- 
dehned functions that implement the data processing logic. This work fo¬ 
cuses mostly on techniques that infer the dependency constraints between 
tasks through examination of their internal semantics rather than on task 


reordering algorithms per se. In (29|, they introduce a suite of quality met¬ 


rics (QoX) without going into how optimization algorithm details. 

In addition, there is a signihcant portion of proposals on how optimiza¬ 
tion that proceed to how structure optimizations but do not perform task 
reordering, as we do. For example, an interesting proposal that aims to com¬ 
bine the control and the data how view of workhows has appeared in (3oj |. 
That work presents approaches that merge tasks related to data management 
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to decrease the number of invocations to the underlying databases without 
changing the relative order of the tasks. In [^, a data oriented method 
for workflow optimization is proposed in order to minimize execution cost. 
This method is based on the fact that data may be shared across several 
functions, and, as such, workflow performance stands to beneht from opti¬ 
mizations in the form of incorporating a shared database to handle common 
data-oriented tasks. Another workflow optimization method that affects the 
workflow structure with a view to improving the efficiency of the workflow 


is presented in j32|. This method is inspired by the current limitations of 


business information processes. In particular, a task redesigning method is 
presented, which is based on the consolidation of the tasks to reduce the over¬ 
all execution time. Quality of Service requirements (QoS), such as precedence 
of information flows and technology support costs are taken into account. In 
33| . a methodology to choose the optimal physical implementation of each 


task and decide whether to introduce special sorting tasks is presented, when 
there are several implementation alternatives. This work does not consider 
the execution order of the flow activities. Several optimizations in workflows 
are also discussed in j^, but the techniques are limited to straightforward 
application of query optimization techniques, such as join reordering and 
pushing down selections. 

Optimization of the resource allocation and scheduling aspects of data 
flows. The main motivation of the proposals in this category stem from the 
need for more efficient resource management, given that resource manage¬ 
ment is deemed as a key performance factor. Contrary to our work, they 
assume an execution setting with multiple execution engines and do not deal 
with optimization of the flow task ordering. For example, in (ssl. 36|. they 


introduce resource allocation algorithms and heuristic techniques that have 
the ability to take into account constraints, such as cost optimization, user- 
specified deadline and workflow partitioning according to assigned deadlines 
3. 0 discusses methodologies about how to execute and dispatch task 


activities in parallel computers. 

Another family of optimization proposals deals with task scheduling meth¬ 
ods, considering aspects such as semantic expression of workflow tasks, dy¬ 
namic selection of services among many candidates and latency minimization 
0 S, 013 . Also, there are scheduling methods which are exclusively re¬ 


lated with grid workflow optimization (e.g., |^, l38|, |^), or linear workflow 


optimization, such as which discusses optimal time schedules given a 
fixed allocation of activities to engines. Also, a set of optimization algo- 
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rithms based on deadline and time constraints was analyzed for scheduling 


flows in l42| . Another proposal of flow optimization is presented in 


based on soft deadline rescheduling in order to deal with the problem of 
fault tolerance in flow executions. In ji^, an optimization methodology for 
minimizing the performance fluctuations that might occur by the resource 
diversity, which also considers deadlines, is proposed. Additionally, there is 
s set of optimization methodologies based on multi-objective optimization. 
For example, an auction-based scheduling methodology for multi-objective 
flow optimization is presented in jd^, while (d^. d^ propose optimization 
methodologies for multi-engine environments meeting multiple objectives, 
such as fault-tolerance and performance. The implementation of some of the 
presented optimization methods mentioned above is carried out with the help 
of algorithms that take into consideration certain quality of service require¬ 
ments (QoS). In this case, users are responsible to set constraints, such as 
reliability, time, security, cost and hdelity, which are the principle parame¬ 
ters of workflow task scheduling. In this work, we do not consider resource 
allocation and scheduling issues, which are orthogonal to task ordering. 

Application-dependent solutions. An important part of workflow opti¬ 
mization research was originated by optimization methods that have been 
created for a specihc applications and as such, they are application depen¬ 
dent. An example of application dependent workflow optimization is dis¬ 
cussed in jd^, which deals with the creation and process of technical docu¬ 
ments by a document workflow management system; in this work, the par¬ 
allelism opportunities presented by the document structure are exploited to 
optimize workflows. Another example is where a process execution 


management framework is proposed in order to optimize business objectives 
of processes in a dynamic business environment. Also, there are workflow 
optimization methodologies applied in other scientihc helds. A represen¬ 
tative example is j^, where the optimization algorithms are used for the 
development of molecular models and they are applied to a simulation tool. 
Analogous examples that achieve workflow optimization only under certain 


circumstances are presented in |51l. l52l . l53l . l5dl|. However, these optimization 


methods cannot be adapted to a more general case. 


10. Conclusions 

In this work, we deal with the problem of specifying the optimal execu¬ 
tion order of constituent tasks of a data flow in order to minimize the sum 
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of the task execution costs. We are motivated by the signihcant limitations 
of fully-automated optimization solutions for data flows, as, nowadays, the 
optimization of the complex data flows is left to the flow designers and is 
a manual procedure. Firstly, as the query optimization techniques are not 
applicable to data flow optimization because of the precedence constraints 
and the existing proposals for optimal solutions cannot scale, there is signif¬ 
icant need to propose new flow optimization methodologies. We show that 
the state-of-the-art optimization algorithms can have 74% higher execution 
cost than the optimal solution even for the simplest type of single-input 
single-output (SISO) flows with a small number of tasks. So, to £11 the gap 
of near-optimal optimization techniques, we propose a set of approximate 
algorithms that can exhibit 40% performance improvements than the best 
existing heuristic. We also introduce a post-process optimization phase for 
parallel execution of the flow tasks in order to improve even more the per¬ 
formance of a data flow, and we show that we can extend these solntions to 
more complex data flow scenarios that deal with arbitrary nnmber of sonrces 
and sinks. This work aims to provide the basis for more holistic flow op¬ 
timization algorithms, which do not only consider more complex flows, bnt 
also combine task ordering with aspects, such as task implementation and 
schednling. 
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Algorithm 5 Dynamic Programming 
Require: A set of n tasks, T={ti, tn} 

A directed acyclic graph PC with precedence constraints 
Ensure: A directed acyclic graph P representing the optimal plan 
{Initialize PartialPlan, Costs and Sel of size 2”' — 1} 

1: for alH € {2,n} do 

2: PartialPlan[2*“^] = ti; Costs[2*“^] = q; Sel[2*“^] = se/j]; 

3: end for 

4: for all s G (2,n} do 

5: R Subsets{T, s) {X is a set with all subsets of T of size s} 

(r is a specihc subset of size s} 

6: tempBest •(— oo 

7: for each r & R do 

8: for all i G {1, ...,r.length{)} do 

9: tempSet ^ r — r{i) 

10: posl ■(— findindex{tempSet) 

11: pos2 ■(— findlndex{r{i)) 

12: if sp(i) has all predecessors in tempSet then 

13: TempPlan tempSet, r{i) 

14: costTempPlan ^ Costs\posl] + Sel\posl]Costs\pos2] 

15: if costTempPlan < tempBest then 

16: tempBest ■(— costTempPlan 

17: kposl + pos2 

18: update{PartialPlan[k], Costsfk], Sel[kJ) 

19: end if 

20: end if 

21: end for 

22: end for 

23: end for 

24: P ^ PartialPlan[2'^ — 1] 


Appendix A. Extra material about the DP algorithm 

In order to implement the algorithm, we use three vectors of size 2” — 1, 
namely PartialPlan, Costs and Sel. According to the algorithm implemen¬ 
tation, the i-th cell corresponds to the combination of tasks for which the 
bit is 1 in its binary representation. For example, if i = 13, then the bi¬ 
nary representation of this position is (1101)2. Specihcally, this means that 
partialPlan[13] corresponds to the optimal ordering of the 1^*, 2"''^ and 4*^ 
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Algorithm input 


Tasks of data flow n = 5 


1 

2 

3 

4 

5 

5 

10 

15 

12 

25 

0.5 

0.2 

0.3 

0.2 

0.5 



Figure A.13: Metadata of the example data flow 

DP algorithm 


0,012 

0,006 


0,015 

0,006 


1 43 
243 
1243 


Subsets of R 



Example computations: 

* For subsets of size 2, we examine the subset in R(1, :) {4 5}: 

4- 5 —► valid partial plan (no precedence constraints) 

5- 4 —► not valid partial plan (precedence constraints) 

1) Find the position of subsets {4} and {5}: 

- pos(4) = 2 * ' = 2 ^ = 6 

■ pos(5) = 2*’ = 2^=16 

- pos(4-5) = pos(4) + pos(5) = 8 + 16 = 24 

2) Cost and selectivity estimation of partial order 4-5: 

- Costs(24,1) = Costs(8,1) + Costs(8,2) ’ Costs(16,1) = 12 + 5 = 17 

■ Sel(24.1) = Sel(8,1) * Sel(16,1} = 0.2 * 0.5 = 0.1 

*For subsets of size 3, we examine the subset in R(2,:) {2 4 5}: 

2-4-5 valid partial plan (no precedence constraints) 
4-5-2 not valid partial plan (precedence constraints) 

1) Find the position of subsets {5} and {45}: 

■ pos(4-5) = 2*-^ + 2® ’ = 24 

- pos(2) = 2 ^-' = 2 

- pos(2-4-5) = pos(4-5) + pos(2) = 24 + 2 = 26 

2) Cost and selectivity estimation of partial order 2-4-5: 

■ Costs(26,1) = Costs(2,1) + Costs(2,2) ’ Costs(24,1) = 13,4 

- Sel(26,1) = Sel(2,1) * Sel(24,1) = 0.02 


Figure A. 14: Example of the DP algorithm. 


tasks. The Costs and Sel vectors hold the aggregate cost and selectivity of 
the snbplans, respectively. The last cell of PartialPlan and Costs contain the 
optimal plan and its total cost, respectively. A complete psendocode is shown 
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in Algorithm El For the sake of simplicity of presentation, the algorithm is 
not fully optimized; e.g., in line 18, the update of vertices may occur only 
once after the hnal best plan is found. 

We give an example of the algorithm with a flow with n = 5; the task 
metadata are shown in Figure IA.13I The DP example is in Figure IA.14I 
First of all, all the subsets i? of T of length K = {1, 2,..., n} are found. For 
single task subsets, such as {fi}, {^ 2 }, {tn}, DP estimates their position in 

the partialPlan matrix, e.g. {2} subset is positioned in partialPlan{2‘^~^, 1). 
For subsets with length greater than 1, e.g., the subset {1, 3,4}, we examine 
the case that each element of that subset is placed at the end of the subset. If 
the precedence constraints are violated, DP continues to the next placement. 
If the precedence constrains are not violated, the algorithm estimates the cost 
of the valid partial plan with that element positioned at the end of the subset, 
reusing the results of the orderings of smaller subsets. Similarly, the cost of 
all orderings in the subset is estimated and the algorithm hnds the ordering 
of the subset with the minimum cost. The optimal partial plan, its cost and 
the product of task selectivities are stored in the corresponding position in 
the partialPLan and DPcs vertices, respectively. For example, the partial 
plan {1,3,4, 5} is stored in position + 2^~^ + 2^~^ + 2^~^ = 29 of the 
partialPlan matrix. 

Correctness-. If PartialPlan is of size n = 1, the optimal solution is trivial 
and is found by the algorithm during initialization in lines 1-3 of Algorithm 
El We assume that a PartialPlan of size n — 1 is optimal and we need to 
prove that PartialPlan of size n is also optimal. The sketch of the proof will 
be based on contradiction. Let us assume that the DP does not produce the 
optimal solution. Any linear solution of size n consists of a PartialPlan of 
size n — 1 followed by the n-th task; DP checks all the alternatives for the 
n-th task. So, there is a different optimal solution, where the PartialPlan 
of size n — 1 is different of DP's PartialPlan of the same size. According 
to the SCM, the cost of the subplan of size n is computed as the sum of 
two components: the cost of subplan of size n — 1 and the cost of the n-th 
task times the selectivity of the hrst n — 1 tasks. The costs of the solutions 
of size n, which end with the same task, differ only in the hrst component. 
According to our assumptions, the cost of DP's PartialPlan of size n — 1 
cannot be higher than any other subplan solution of size n — 1 by dehnition. 
Consequently, there is no other solution different from DP's solution that 
can yield lower cost. This completes the proof. 
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Algorithm 6 TopSort 

Require: A set of n tasks, T={ti, tn} with known costs and selectivities. 

A directed acyclic graph PC with precedence constraints. 

Ensure: An ordering of the tasks P representing the optimal plan. 

1: G={ti, t 2 , ..., tn} {G is initialized with a valid topological ordering ordering 
of PC.} 

2: i=l 

3: minGost computeSGM(G) 

4: while z < n {n is the total number of tasks} do 
5: k location(l,i) 

6: kl ^ k + 1 

7: if G{kl) task has prerequisite i then 

8: // Rotation stage 

9: Rotate the elements of G from positions i to A: 

10: cost •(— computeSGM(G) 

11: i^ i+1 

12: else 

13: // Swapping stage 

14: Swap the k and A:1 elements of G 

15: cost ^ computeSGM(G) 

16: i ^ 1 

17: end if 

18: if cost < minCost then 

19: P ^ G 

20: minCost = cost 

21: end if 

22: end while 


Appendix B. Extra material about the TopSort algorithm 

The algorithm’s pseudocode is presented in Algorithm |6l The algorithm 
exhaustively checks all the permutations that satisfy the precedence con¬ 
straints, and as such, it always hnds the optimal solution for linear flows. 
The computeSCM function needs to be constructed in a way that does not 
compute the cost of each ordering from scratch, which is too naive, but lever¬ 
ages the computations of the previous plans taking into account the local 
changes in the new plan. In Figure IB. 151 an example of hnding the optimal 
plan of a flow using TopSort is presented. In this example, the running steps 
of topSort algorithm are depicted, given as input a valid flow execution plan 
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TopSort algorithm 


Initial plan order 


P= 1 2 3 4 5 


After rotation 



SCM{P} = 12.01 


CP{1,2)=1 —► 1=1, k=1 —► rotate P(l:k) 
SCM{P} = 12.01 


CP(2,2)=0 —► k=2, k1 =3 —► swap P(k} with P(k1) 
SCM{P) = 14.51 


CP{1,3)=1 —► i=1, k=1 —► rotate P(i:k) 
SCM{P) = 14.51 


CP(2,4)=1 —► i=2, k=3 —► rotate P(i:k) 
SCM{P) = 12.01 


CP(3,3)=0 —► k=3, k1=4 —►swap P(k) with P{k1) 
SCM(P) = 11.65 


CP(1,2)=1 —► i=1, k=1 —► rotate P(i:k) 
SCM{P) = 11.65 


CP{2,4)=1 —► i=2, k=2 —► rotate P(i:k) 
SCM{P) = 11.65 


CP(3,5)=1 —► i=3, k=4 —► rotate P(i:k) 
SCM{P) = 12,01 


CP(4,5)=1 —► i=4, k=4 —► rotate P(i:k) 
SCM(P) = 12,01 


SCM(P) = 11.65 


Figure B.15: Example of TopSort algorithm. 


{Initial plan order plan label) and assnming the metadata of Figure IA.13I 
Each of the given plans describe a plan generated after either a rotation or a 
swap action. The optimal flow execution plan is the one labeled Final plan 
order. 

Note that we can implement TopSort in a different way, where the tasks 


are checked from right to left. Although in [19| this flavour is claimed to 


be capable of yielding better performance, this has not been verified in our 
flows. 


Appendix C. Extra material about the existing approximate al¬ 
gorithms 

Here we present the pseudocode for the Swap, Greedyl and Partition 
algorithms fAlgorithms 171151 [TUI respectively). Figures 10.1611C.17l and 1C. 
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Algorithm 7 Swap 

Require: A set of n tasks, T={ti, tn} 

A directed acyclic graph PC with precedence constraints 
Ensure: A directed acyclic graph P representing the optimal plan 
1: P randomValidPlan(PC) {Initiliaze P} 

2: swapping ^ true 

3: while {swapping == true) do 

4: swapping ■(— false 

5: for all tasks ti £T do 

6: if has not as prerequisite ti then 

7: if {computesCM{ti ti+i) < computeSCM{ti^i — >■ ti) then 

8: swap ti and tj+i in P 

9: swapping ^ true 

10: end if 

11: end if 

12: end for 

13: end while 

Swap algorithm 


Initial plan order 



Figure C.16: Example of Swap algorithm. 


present examples for the input in Figure lA.131 
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Algorithm 8 Greedyl 

Require: A set of n tasks, T={ti, tn} 

A directed acyclic graph PC with precedence constraints 
Ensure: A directed acyclic graph P representing the optimal plan 
1: P ^ 0 

2: Cand ^ 0 {Cand holds the candidate tasks} 

3: C •(— 0 {C holds the considered tasks already in P} 

4: updateCandidates {Cand, PC, C, T) 

5: while list Cand is not empty do 
6: for all tasks tj in Cand do 

7: Find task tj with maximum cost where cost={l-selj)/costj 

8: end for 

9: Add tj task to optimal plan P 

10: C ^ C U Sj 

11: updateCandidates {Cand, PC,C,T) 

12: end while 


Algorithm 9 Function updateCandidates 
1: updateCandidates {Cand, PC, C, T) 

2: for all tasks ti in T do 

3: if task ti ^ C then 

4: if task ti has no prerequisites then 

5: Add task ti to list Cand 

6: else 

7: if all of the prerequisites G C then 

8: Add task ti to list Cand 

9: end if 

10: end if 

11: end if 

12: end for 


Appendix D. Extra material about the rank ordering-based tech¬ 
niques 

In this section, an illustrative example of the rank ordering methodologies 
is presented. Figure ID.19I depicts metadata details for a data flow with 
10 tasks, which are used as input for the application of RO-I, RO-II and 
RO-III algorithms. Specifically, this figure shows the PC graph, the values 
of selectivity and cost, but also the rank values that corresponds to each 
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Greedy I algorithm 


candidateTask = 1 


partialPlan = 1 


candidateTask = I 2 
partialPlan = | 1 

candidateTask - 


t, is the only task that can precede all the others 


cost(t 2 ) = 10 < cost(tj)=15 


partialPlan = 1 I 3 I 2 


task tj is the only candidate task 


candidateTask 


partialPlan = I 1 


candidateTask 


4 I 


task t^ is the only candidate task 


task tj is the only candidate task 


P= I 1 I 3 


5 I SCM(P) = 14.51 


Figure C.17: Example of Greedy algorithm. 

Partition algorithm 


t^ is the only task that can precede all the other tasks 


partialPlan B I 1 


partition = I 2 I 3 


cost{t2j) = 18 > cost(tjj) = 13 


partialPlan =| 1 2 3 


partialPlan = I 1 2 I 3 I 4 




task t, is the only candidate task 


task tg is the only candidate task 


H ^ I 

4 

5 

I SCM(P) = 12,01 ] 


Figure C.18: Example of Partition algorithm. 


task of the flow. We should mention that the sink node of the data flow 
is disconnected from the flow in the precedence constraint graph, as it is 
assumed that all the flow tasks must precede this task, and we connect it 
after the optimization procedure is finished. The detailed examples of the 
rank ordering proposals are described in extend in the following. 

In Figure Id. 20l we present the pre-processing phase of the RO-I, in order 
to transform the precedence constraint graph into tree-shaped graph. The 
graph of the hgure shows the hnal result of the dependency constraint graph. 
Then, we apply the KBZ algorithm, which is depicted in ID.21I 
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Algorithm 10 Partition 

Require: A set of n tasks, T={ti, tn} 

A directed acyclic graph PC with precedence constraints 
Ensure: A directed acyclic graph P representing the optimal plan 
1: P ^ 0 

2: Cand ^ 0 {Cand holds the candidate tasks} 

3: C •(— 0 {C holds the considered tasks already in P} 

4: updateCandidates {Cand, PC, C, P) 

5: while (Cand != 0) do 

6: Estimate all possible permutations of the tasks ti € Cand 

7: tempBestCost ■(— 0 

8: tempBestPlan 0 

9: for each possible permutation perm do 

10: costPerm ^ computeS CM (per mCand) 

11: if (costPerm < tempBestCost) then 

12: tempBestCost costPerm 

13: tempBestPlan •(— perm 

14: end if 

15: end for 

16: Append perm to P 

17: C ^ C U Cand 

18: updateCandidates {Cand, PC, C, T) 

19: end while 



Sink node 

10 


Figure D.19: The precedence constraint graph (PC), cost, selectivity, rank values of a data 
flow with 10 activities and the total execution cost. 
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Acyclic PC - Step 1 


df % 

d) d) h 



Fortg! rank3>rank5 -► PC(t5,y=0 

Fortgi rankj>rank^ -► PC(tj,t3)=0 


Figure D.20: The pre-processing phase of RO-I to ensure that there are not cycles in the 
PC graph. 

In the following, the validity post-process phase of RO-I is analyzed in 
ID.22l and ensures that the optimized execution flow plan does not violate the 
dependency constraints. Finally, as is shown in this hgure, the cost of the 
optimized execution plan is 237.0844. 

The Figure Id. 23I illustrates in detail the steps of the application of RO-II. 
The steps 1-3 describe the pre-processing phase of RO-II, where we merge 
two sub-segments into a linear sub-flow, because they create cycles by sharing 
the same intermediate source and sink. The cost of the optimized flow plan 
returned by RO-II methodology is 317.3132. 

In Figure ID. 241 the result of the post-processing phase of algorithm RO- 
III is described. In this phase the optimized flow plan occurred by moving 
the flow task R to a later stage. The optimized cost of the flow execution is 
205.5607. 
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0) 


PC - step 2: KBZ application 


df % 

i d ^ 

^ Sink node 

9) 10 


d d ^ 




A 

24589 


% 


Sink node 

10 


d h 


• rank2>rank^ 

• rank^< rankg-► merge with 

cost^g = cost^ + sel^ * costg = 205.8 

rank4j. = (1 -sel^^) / cost^^ = -0.0051 


• rank^ 5 < rankg-►merge t„ with tg 

‘=°S*458= ‘=°®*45 + ®®'4S * cost, = 309.84 

SC'45B=®C'45*®C'b = '>-2M 

rao^ss = c ■sel458) I cost,,, = 0.0026 

• rank,,, > rank, 

• rankj < rank^^gg- ^ merge with t^g 

‘=°®*2458 = ‘=°®‘2 ■" ®®'2 * = 624.728 

®Cl24S8=S = '2*S®'458 = “-3'‘68 

raoK„„= (1-sel„„) / cost„„ = 0.0011 

• rank,,,, < rank,-► merge t„„ with t, 

C‘>s* 24589 = cost,,,, + sel„„ * cost, = 658.0208 

5Cl24589 =«l,,,,*Sel, = 0.2774 
rank,,,,, = (1 -sel,„„) / cost,,,,, = 0.0011 


PC - Step 3: KBZ application 



Sink node 

10 


i 


• rankg > rankg 
< rankg >rank7 


PC - step 4: KBZ application 


® 


►© kD KD K® ►© ►d) KD ►© ►10 


rank, > rank, > rank,,,,,, > rank. 


Figure D.21: The optimization phase of RO-I by applying the KBZ algorithm. 
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PC - step 5: Validity post-process 


® pKI) Kl) pK?) ►© KD—pK2> ►© 


PC(t 2 ,tg) = 1 -► tj must precede tg (precedence constraint violation) 

Find the position of tgthat preserves precedence constraints: 

PC(t,,tg) = 1 -► t, must precede tg 

PC(t 5 ,tg) = 1 —► tg must precede tg 
PC(tg,tg) = 0 -► precedence constraint free 


® kD kD ►k® pkI) kd kD ^►(D kJ) 


• PC(ty,t 9 ) = 1 -► t^ must precede tg (precedence constraint violation) 

Find the position of tgthat preserves precedence constraints: 

. PC(t„t,„) = 1 —► t, must precede t^^, which is the sink task 


® KI) KI) ►© KD KD kD kD-►io 

Optimized Cost: 237.0848 


Figure D.22: The validity phase of RO-I that ensures that there are not precedence 
constraint violations in the optimized execution plan. 


I Acyclic PC - Step 11 


[Acyclic PC-step 2 1 [Acyclic PC-Step 3 j 


[Apply KBZ - step 4i 


( 1 ) 


sf % 



Sink node 

10 


(1> 

T 

C'3) 


df % 





(iT 


For cycle t,-rank,>rank,—►PC(t3,t,)=1 
1 For cycle tj-1,: rank,>rankj—►PC(t,,tj)=1 
For cycle t,-1,: Merge the chain with chain t. 




Figure D.23: An application example of RO-II with the metadata of Figure ID.191 


Appendix E. Extra material about the PGreedy algorithms 
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PC - Post Process step: 


® KD ►d) ►(!) ►CD ►CD KD hD ►ip) 

RO-II Optimized Cost: 317.3132 


■ The transposition of t^ after t^ return lower total execution cost. 


® KD kD ^0 KD KD KD k 2) KD ►ip> 

Optimized Cost: 205.5607 


Figure D.24: The post-process phase of the RO-III algorithm taking as input the generated 
optimized execution plan of RO-II, as depicted in lD.23l 

PGreedyl algorithm is shown in Algorithm [11] In this methodology the 
compntation of each task cost was considered by two flavors. The hrst one 
is similar with the cost metric in [l^, where the cost of the task is dehned 
as eqnal to inpiCi in each step. In this case, we add the candidate task 
that minimizes the inptCi to the optimal partial plan . In the second flavonr 
PGreedyll the cost metric becomes (1 — seli) / {inpiCi). This metric takes into 
acconnt the selectivity of the next service to be appended in the execution 
plan and not only the selectivity of the preceding services. In Figure IE.251 
an example of the PGreedyl algorithm application based on the second cost 
metric is analyzed, given the cost, selectivity values, but also the precedence 
constraints. 

In Figure IE.261 the evaluation results of the performance improvement 
of the PGreedy flavours are shown. In this experiment, we compare our 
proposal of PGreedy optimization algorithm with its rank-based flavour, de¬ 
noted as PGreedyll, but also each of these flavours is compared with the 
Swap heuristic and the initial plan cost. The presented performance results 
of Figure [E.26I are normalized by the cost of the initial randomly generated 
flow execution plan. In Figure [E.26l( left). the PGreedy has up to 95% better 
performance improvement than the initial plan cost, whereas the execution 
cost of PGreedyll can be up to 97% lower than the initial one. In most of 
the iterations, PGreedyRank seems to be clear winner. In the worst case. 
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Algorithm 11 PGreedy 

Require: A set of n tasks, T={ti, G} 

A directed acyclic graph PC with precedence constraints 
Ensure: A directed acyclic graph P representing the optimal plan 
1: Initialize an adjacency matrix P of optimal plan as empty 
2: Initialize a list Cand of candidate tasks as empty 
3: Initialize a list C of considered tasks as empty 
4: updateCandidates {Cand, PC, C, P) 

5: while list Cand is not empty do 
6: for all tasks tj in Cand do 

7: Vj ■(— optimal value using a linear programming technique, which deter¬ 

mines the optimal cost of adding tj in optimal plan P 
8: Cutj ^ optimal cut for adding tj {cut: set of tasks that are the imme¬ 

diate predecessors} 

9: end for 

10: topt task having the least Vj 

11: Cut opt ^ optimal cut for adding topt 

12: Add topt task to optimal plan P while directed edges from the tasks in Cutopt 

to topt 

13: C^CU Topt 

14: updateCandidates {Cand, PC, C, P) 

15: end while 

16: computeCost(P, costs, selectivities) 


PGreedyllimproves the performance of the non-optimized plan by no less 
than 54% on average. Also, Swap in the best case has up to 89% better 
performance improvement than the initial flow plan. For 80% precedence 
constraints, as Figure [E.26I shows, the PGreedyll algorithm outperforms the 
other algorithms in all the data flows scenarios, even if the performance im¬ 
provement decreases on average because of the limited possible reorderings. 
Specihcally, in the best case, which is a flow with 70 tasks, PGreedyRank has 
74% lower execution cost, while Swap improves the initial execution cost by 
58%. 
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Algorithm input 

Tasks of data flow n = 6 


PC 


candidateTask 


PGreedyl algorithm 


w 


candidateTasks s I 2 I 3 


partialPlan = I 1 I 3 


candidateTask = 


t, Is the only task that can precede all the others 


.=0 

partialPlan = I 1 I 3 I 2 I 


cost(tj) =-1 < costltj) =-0.5 


task t, is the only candidate task 


candidateTasks = I 4 I 5 


partialPlan = | 1 | 3 | 2 | 4 | 

candidateTask = H 
partialPlan =|l|3|2|4|5| 


candidateTask' 


0 


cost(t^) = 0.3 > costitj) = 0 


task t; is the only candidate task 


task tg is the only candidate task 


P=|1|3|2|4|5|6| I SCM(P) = 7.9 


(3>-K2) @ ^>-KS>-K6) non-linear flow 


Figure E.25: Example of PGreedy algorithm. 



Figure E.26: Performance improvement for data flows with n G [10,100] with 40% prece¬ 
dence constraints (left) and 80% precedence constraints (right). 
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